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Abstract 



This paper analyzes the cache miss cost of algorithms when scheduled using randomized 
work stealing (RWS) in a parallel environment, taking into account the effects of false sharing. 

First, prior analyses [I] are extended to incorporate false sharing. However, to control the 
possible delays due to false sharing, some restrictions on the algorithms seem necessary. Accord- 
ed ' ingly, the class of Hierarchical Tree algorithms is introduced and their performance analyzed. 

In addition, the paper analyzes the performance of a subclass of the Hierarchical Tree Al- 
gorithms, called HBP algorithms, when scheduled using RWS; improved complexity bounds are 
obtained for this subclass. This class was introduced in ^ with efficient resource oblivious 
computation in mind. 



^ I Finally, we note that in a scenario in which there is no false sharing the results in this paper 

match prior bounds for cache misses but with reduced assumptions, and in particular with 
no need for a bounding concave function for the cost of cache misses as in [IT. This allows 
non-trivial cache miss bounds in this case to be obtained for a larger class of algorithms. 



1 Introduction 



Work-stealing is a longstanding technique for distributing work among a collection of processors [5l 
■ [T3 | 117 1 Work-stealing operates by organizing work in tasks, with each processor managing its 

currently assigned tasks. Whenever a processor p becomes idle, it selects another processor q and is 
given (it steals) some of g's available tasks. A natural way of selecting q is for p to choose it uniformly 
at random from among the other processors. To emphasize the inherent randomization in this 
^ . process, we call it randomized work stealing, RWS for short. RWS has been widely implemented, 

I including in Cilk [3], Intel TBB [12] and KAAPI [12) . This methodology is continuing to increase 

in importance due to its applicability to multicore computers in which each processor (or core) has 
a private cache. 

RWS scheduling has been analyzed and shown to provide provably good parallel speed-up for 
a fairly general class of algorithms [2]. Its cache overhead was considered in [l], which gave some 
general bounds on this overhead; these bounds were improved in [TT] for a class of computations 
whose cache complexity function can be bounded by a concave function of the operation count. 
The bound in [1] when applied to processing a list of tasks was recently improved in [TB] . 
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These analyses assume that there is no false sharing. False sharing can occur as a result of 
employing cache coherency protocols, when data is moved to and from cache in blocks (cache 
lines) comprising multiple words. False sharing refers to two different processors seeking to access 
distinct locations in the same block, and if one or both seeks to perform a write, only one of them 
can access the block at a time. This reduces the possible parallelism and potentially increases 
algorithm runtime. 

In this paper, we analyze the efficiency of algorithms when scheduled using RWS, taking account 
of delays due to false sharing. As in prior work, there are two parts to the analysis; bounding the 
number of steals, and bounding the additional costs of stolen tasks, which depend in part on the 
number of steals. Accounting for false sharing (or more generally, block misses) affects both parts. 

To achieve high efficiency, we will need algorithms that organize their writes in a way that mini- 
mizes interaction among the different processors and hence among the tasks forming the algorithm. 
We will characterize a class of such algorithms, termed Hierarchical Tree Algorithms, and analyze 
their performance. 

In addition, we give a more refined analysis for a subclass of these algorithms, the Hierarchically 
Balanced Parallel (HBP) algorithms, which were introduced in [6l[7]- This class includes standard 
divide and conquer algorithms such as matrix multiply (used as a running example in this paper), 
FFT, a new sorting algorithm [3 [8], and some list and graph algorithms [6]. 

2 Computation Model 

We model a computation using a directed acyclic graph, or dag, D (good overviews can be found 
in OH]). D is restricted to being a series-parallel graph, where each node in the graph corresponds 
to a size 0(1) computation. Recall that a directed series-parallel graph has start and terminal nodes. 
It is either a single node, or it is created from two series-parallel graphs, Gi and G2, by one of: 

i. Sequencing, where the terminal node of Gi is connected to the start node of G2. 

ii. A parallel construct, which introduces a new start node s and a new terminal node t. s is 

connected to the start nodes for Gi and G2, and their terminal nodes are connected to t. 

D supports multithreaded computation by enabling two threads to continue from each node s in 
(ii) above; these threads then recombine into a single thread at the corresponding node t. This 
multithreading corresponds to a fork-join in a parallel programming language. 

We will be considering algorithms expressed in terms of tasks, a simple task being a size 0(1) 
computation, and more complex tasks being built either by sequencing, or by forking, often ex- 
pressed as recursive subproblems that can be executed in parallel. Such algorithms map to series- 
parallel computation dags. 

In RWS, each processor maintains a work queue, on which it stores tasks that can be stolen. 
When a processor C generates a new stealable task it adds it to the bottom of its queue. If C 
completes its current task, it retrieves the task r from the bottom of its queue, and begins executing 
r. The steals, however, are taken from the top of the queue. 

An idle processor G' will pick a processor C" uniformly at random and independently of other 
steals, and attempts to steal from the top of C""s task queue. If the steal fails (either because 
the task queue is empty, or because some other processor was attempting the same steal, and 
succeeded) then processor G' continues trying to steal, continuing until it succeeds. 
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We consider a computing environment comprising a collection of p processors, each equipped 
with a local memory or cache of size M. There is also a common shared memory of unbounded 
size. Data is transferred between the shared and local memories in size B blocks (or cache lines). 

We are mainly interested in algorithms that are optimal both in terms of their cache miss cost 
and their work, and on the maximum parallelism that can be achieved given these desiderata. In 
this paper, we delineate constraints on the algorithms that enable cache efficiency and show how 
to analyze the cache and block miss costs of such algorithms. As we will see, these constraints are 
not onerous: they are observed by a variety of standard parallel algorithms, modulo at most small 
changes. 

2.1 Cache and Block Misses 

We distinguish between two types of cache-related costs, as discussed in [6]. 

The term cache miss denotes a read of a block from shared-memory into processor C"s cache, 
when a needed data item is not currently in the cache, either because the block was never read 
by processor C, or because it was evicted from C's cache to make room for new data. This is 
the standard type of cache miss that occurs, and is accounted for, in sequential cache complexity 
analysis. 

The term block miss denotes an update by a processor C" 7^ C to an entry in a block (3 that 
is in processor C's cache. This results in block /3 being invalidated, and results in processor C 
needing to read in block /3 the next time it accesses data in this block. This is done so that data 
consistency is maintained within the elements of a block across all copies in caches at all times. 
This type of 'cache miss' does not occur in a sequential computation and seems challenging to 
bound. We will refer to this type of caching cost as a block miss; this includes the cost of 'false 
sharing'. There are other ways of dealing with block misses, but we believe that the block miss cost 
with our invalidation rule is likely as high as (or higher than) that incurred by other mechanisms. 
Thus, our upper bounds should hold for most of the coping mechanisms known for handling block 
misses. 

Work stealing causes the algorithm execution to incur additional cache misses and introduces 
block misses. Our analysis does not make any assumptions about the mechanism used for handling 
accesses to a shared block and in particular does not assume it is fair. This can result in block 
misses being unboundedly expensive. Instead, we use algorithmic techniques to control these costs. 

2.2 The Main Results 

To achieve good bounds, we consider algorithms that exhibit good data locality. Techniques en- 
abling this have been developed for cache aware and cache oblivious algorithms; we will introduce 
additional methods to help control the cost of block misses. In particular, we characterize a class of 
recursive algorithms we call Hierarchical Tree Algorithms, which, in addition, satisfy the following 
properties: 

i. Data locality in the writes. 

ii. Limited access: each writable variable is written 0(1) times. 

iii. A constraint on space usage, we call top- dominance. 

iv. A sufficient shrinkage in the size of recursive subproblems. 

These properties are also used in our companion paper [6] in the analysis of its deterministic 
PWS scheduler. 
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At this point, a definition of task size will be helpful. 

Definition 2.1. A task r is said to have size r, denoted \t\, if it accesses r distinct [one-word) 
variables over the course of its execution. 

We use the following parameters to specify our results. Let A be an algorithm described by 
a series-parallel dag D. Suppose that on an input of size n, in the worst case, in a sequential 
execution, A performs W operations and incurs Q cache misses. Suppose that an operation on 
in-cache data takes 0(1) time units, that the cost of a cache miss is 0{b) time units, the cost for 
stealing a task is Q{s) time units, and the cost for an unsuccessful steal is 0(s), allowing for the 
possibility that an unsuccessful steal has less cost than a successful one. Let T^o be the maximum 
length of the paths descending the dag D representing ^'s computation, let E he a. bound on the 
cost, measured in cache misses, incurred in performing reads and writes at any single node of D, 
and let Dfe be a bound on the cost, measured in cache misses, incurred in performing reads and 
writes on any single path in any execution of D. Clearly, Di, < EToo- 

The analysis has four parts: 

1. A bound on the cost of cache misses as a function of the number S of stolen tasks. We obtain 
the same bounds as Frigo and Strumpen [11], but with a more direct analysis that avoids the need 
for an assumption that the cost of cache misses is bounded by a concave function of the work 
performed by a subtask. Rather, we use an assumption that the cost of cache misses is bounded 
by a non-decreasing function of the size of a task. 

We illustrate this methodology by example, using two standard algorithms for matrix multiply. 
The first algorithm is the depth log^ n recursive algorithm that calls 8 subproblems in parallel. 
The second algorithm is the depth n in-place algorithm, which calls two sets of 4 subproblems in 
sequence; this algorithm does not observe the limited access property (from (ii) above); we make a 
small modification, which ensures the property holds (but the algorithm is no longer in-place). 

This methodology can handle algorithms not covered by the the method in [11], which appears 
to mainly handle in-place recursive algorithms. By contrast, many of the algorithms the new 
analysis is applied to use recursive procedures that also call a distinct second recursive procedure 
(e.g. a matrix addition inside a matrix multiply). 

Since the bit-interleaved (BI) format is more cache-efficient than the Row Major (RM) for- 
mat for matrix algorithms, we analyze both matrix multiply algorithms assuming the BI format. 
Accordingly, we also describe algorithms for converting between these two formats. The natural 
recursive algorithm for converting RM to BI is optimal both in terms of its operation and its cache 
and block miss cost; however, this is not true for the natural algorithm for converting BI to RM, 
which potentially has a high cost due to block misses. Instead, in Section 14.31 we will describe a 
slower algorithm for performing this conversion, but one that is more efficient in terms of its block 
misses. The costs of this algorithm are dominated by those for the matrix multiply algorithms 
we consider, and consequently, the complexities of these matrix multiply algorithms are the same 
whether their inputs and outputs are in RM or BI format. Another, yet better conversion algorithm 
from [6] is mentioned briefly in Section [71 

2. A bound on the additional cost due to block misses. This will be 0{B) per stolen task for the 
class of algorithms we consider. This will follow from the above properties: Properties (ii)-(iv) 
allow us to show an 0{B) bound on the delay due to block misses in accessing any one shared 
block. Property (i) is concerned with limiting the number of shared blocks per task; it is an 0(1) 
bound for the algorithms we consider and is an individual algorithm design issue. 
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3. A bound on the number of successful steals, which applies to any computation described by a 
series-parallel computation dag, and not just those covered by part (2) above, so long as there is 
a bound E on the cost of cache and block misses as specified above. This uses an analysis similar 
to that of Acar et al. [l], but will depend in part on the cost of the block misses, and also loosens 
the assumption regarding the cost of an unsuccessful steal by allowing an unsuccessful steal to cost 
less than a successful one. We have: 

With probability l — 2~"'^°°, for a = uj{l), a series-parallel computation A has at most the following 
number of successful steals: ^ 



Oip 



S 



{l+a 



For the class of algorithms we consider, by (2), E = 0{B). For this class of algorithms, = 
0{EToo) but it is not clear whether this is a tight bound. 

4. An improved bound on the number of successful steals for the case of HBP algorithms defined in 
[6]. These algorithms require forked subproblems to be of roughly equal size. They form a subclass 
of the Hierarchical Tree Algorithms. For this class of algorithms, the impact of the block misses 
on the number of steals can be bounded more sharply. We improve the bound in (3), replacing the 
term ^^ET^ hy ^D^. This is significant for we are able to obtain and use tighter bounds on 
than the earlier 0{EToo). In particular, for a computation described by an m-leaf forking tree, D^, 
reduces from 0(5 log m) to 0{inin{B,m} + logm). Corresponding improvements are obtained for 
the Dfe terms for more complex algorithms. This result is shown in Theorems 16.21 and 16.31 

Finally, in Section [3 we derive complexity bounds for several other HBP algorithms. 

3 Bounding the Cache Misses 

Our method for bounding the cache misses determines bounds on the worst case number of tasks 
of a given size that can be stolen in any execution. For stolen tasks of size 2M or larger, up to 
constant factors, the same cache miss costs would be incurred even if there were no steal. For 
smaller tasks, the incurred costs are a function of the task size, which combined with the bounds 
on the number of tasks of a given size, yields bounds on the cache miss costs as a function of the 
number of stolen tasks. 

We illustrate this method using two algorithms for matrix multiply (MM). We assume that the 
matrices are in the bit interleaved (BI) layout, which recursively places the elements in the top- 
left quadrant, followed by the recursive placement of the top-right, bottom-left, and bottom-right 
quadrants. This layout is well-known to be the effective in minimizing data movement [iQl. In 
Section [4.31 we discuss algorithms for converting between the row major (RM) and BI layouts. 



Depth n MM The algorithm we consider, from j9], recursively multiplies an initial four pairs 
of n/2 X n/2 matrices, recording the results, and then multiplies a second collection of four pairs, 
adding the second set of results to the results from the initial multiplications. It has W = O(n^), 
Q = 0{n^/{BVM)), and = 0(n). 

We now demonstrate the same cache miss bound as obtained by Frigo and Strumpen [llj. But 
this is a slightly different algorithm as Frigo and Strumpen have the matrices in RM format (we 
need the BI format to control the block miss costs as we will see later). 
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Lemma 3.1. The depth n MM algorithm incurs 0{ny{BM^/^) + S^/% + S) cache misses when 
it undergoes S steals. 

Proof. We begin by bounding the cache miss cost for the initial task and the stolen tasks of size 
2M or larger; we will show that collectively they incur 0{n^ / (B^/M)) cache misses. For simplicity, 
we assume that n and M are integer powers of 2. Ignoring the work removed by smaller stolen 
subtasks, these tasks each solve one or more distinct matrix multiply subproblems of size 2M. Each 
such subproblem incurs 0{M/B) cache misses, and there are 0(n^/M'^/^) of them. This yields a 
total of 0{n^/{BM^/'^)) cache misses incurred by these tasks. 
For smaller stolen tasks r, the cache miss cost is 0([|r|/i?)]). 

In fact, the thread for a stolen task can exhibit an expanding size as it proceeds. For example, 
a steal could be of say a 1 x 1 MM subproblem, which because it is the last subproblem to complete 
at the join, continues to execute the remaining half of a 2 x 2 MM subproblem (comprising the 
remaining four 1x1 subproblems) , which is followed by the remaining half of a 4 x 4 MM subproblem, 
etc. However, for this example, and indeed for every steal, the data needed for the last (and largest) 
remaining half MM subproblem computed, r say, contains the data needed for all the earlier 
subproblems the steal computes, and therefore if this last subproblem is of size M or smaller, the 
steal incurs 0([|r|/i?)]) cache misses. Consequently, the prior analysis continues to apply. 

It follows that the worst case bounds arise assuming the small stolen tasks are all as large as 

possible. For each integer i > 0, there are O f 8* f-^V j stolen subtasks of size M/4*. They cause 



O [ (2*^ + 8*) • \ ) ) cache misses. Summing over all i, given that there are S stolen tasks in 



all, yields that S = O ^8* (^"^^^ ^ ^-nd hence gives a total of 0{S^/^^ + S) cache misses. 



We will return to this bound later when we have obtained bounds on S. 

The standard way of implementing this algorithm is as an in-place process. However, this 
violates the limited access property (for each output array location is written n times). We make 
the algorithm limited access as follows: for each recursive subproblem, we create a local array to 
store the results of its subproblems, which are then added together and written to the array for 
the parent subproblem. This increases the number of operations by a factor of 2, but this can 
be reduced to, for example, less than 1% additional operations by having the base case comprise 
10 X 10 matrices. The local arrays also increases the space usage to 0(n^ logp). This appears to 
be a necessary part of our method for controlling block misses. 

Finally, we note that the cache miss analysis of Frigo and Strumpen cannot be immediately 
applied to this variant of MM. The reason is that there are subtasks which are matrix additions, 
and they have a larger cache miss to work ratio than the matrix multiply tasks. By contrast, the 
cache miss to size ratio is only reduced, and this immediately yields the bound of Lemma 13.11 for 
this variant of the algorithm. 

Corollary 3.1. The limited access version of the depth n MM algorithm incurs 0{n^ / {BM^^/"^) + 
S^/^\ + S) cache misses when it undergoes S steals. 

Proof. The number of addition subtasks of a given size, up to constant factors, is no larger than the 
number of multiplication subtasks of the same size. Thus the presence of these addition subtasks. 
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which are no more expensive in cache misses than the multiphcation subtasks, does not alter the 
prior bound on cache misses. ■ 



Depth log"^ n MM This algorithm multiplies two nxn matrices by recursively multiplying eight 
n/2 X n/2 matrices, followed by a tree computation to add pairs of these recursively multiplied 
matrices. This algorithm has Too = O(log^n), W = O(n^), and Q = 0{n^ /{B^/M)). 
The exact same analysis applies to this algorithm, yielding: 

Corollary 3.2. The depth log^ n MM algorithm incurs 0(n^ / {BM^/'^) + S^/^^ + S) cache misses 
when it undergoes S steals. 

The difference, as we will see, is that this algorithm incurs far fewer steals than the depth n 
algorithm. 

Space Usage The depth log^ n MM algorithm uses space 0(p^/^n^), which is larger than the 
0(n^ log space usage of the limited access depth n MM algorithm, which in turn is larger than 
the in-place O(n^) space use of the depth n algorithm, but for this latter algorithm it is not clear 
whether there are good bounds on the block delay costs. 

4 Bounding Block Misses 

The delay caused by different processors writing into the same block can be quite significant, and 
this is a caching delay that is present only in the parallel context. These costs might arise if two 
processors are sharing a block (which occurs for example if data partitioning does not match block 
boundaries) or if many processors access a single block (which could occur if the processors are all 
executing very small tasks). We refer to any read of a block that is not in cache due to the block 
being shared by multiple processors as a block miss. 

In particular, consider a parallel execution in which two or more processors between them 
perform multiple accesses to a block /3, which include x > 1 writes. These accesses could cause 
0(6 • x) delay at every processor accessing fi, where h is the delay due to a single cache miss. We 
will measure this delay in units of size 6, the bound on the cost of a cache miss. Consequently, 
henceforth, we will refer to this as a Q{x) block delay, which we associate both with the block and 
the processors accessing the block. 

Further, x can be arbitrarily large unless care is taken in the algorithm design. We establish 
that X = 0(B) when our recursive algorithms use limited access variables (Propertv 14. ip and are 
exactly linear space bounded (a special case of Property 14. 2p . and the runtime system observes a 
natural space allocation property (Propertv 14. 3p . Of course, it is only when there are one or more 
writes to a block that there can be any block misses. 

Definition 4.1. Suppose that block 13 is moved m times from one cache to another (due to cache or 
block misses) during a time interval T = [^1,^2]- Then m is defined to be the block delay incurred 
by f3 during T . 

The block wait cost incurred by a task t on a block (3 is the delay incurred during the execution 
of T due to block misses when accessing (3, measured in units of cache misses. 
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If a task T executes during time interval Tj- and accesses block /3 during its execution, then 
clearly the block wait cost incurred by r on /3 is no more than the block delay of /3 during Tr- 

In order to analyze the block miss costs, we need to explain how the program variables are 
stored. Let r be either the original task in the computation of ^ or a stolen subtask. When a task 
r is initiated, an execution stack Sr is created to keep track of the procedure calls and variables 
used in r's execution. While this execution stack is created by the processor C that starts r's 
execution, if another processor C takes over r's execution, by being the second (and hence last) 
processor to finish the work preceding a join, then C' will continue using Sr for the remainder of 
r's computation (at least up until yet another processor takes over r's computation). The variables 
on Sr may be accessed by stolen subtasks also. As Sr grows and shrinks, with each growth period 
corresponding to the creation of new variables, these new variables may be stored in reused portions 
of a block /3, and this may happen repeatedly. In Lemma 14.41 we show a bound of y(|r|,i?), for a 
suitable function Y, on the block delay r faces in accessing a block f] on Sr- For the algorithms we 
consider, Y{\t\,B) = min{|r|, i?}. 

In addition to the variables stored on the execution stacks, the algorithm needs variables in 
which to store its output. These output variables, which may be arrays, are stored in memory 
locations separate from those used for the execution stacks, and share no blocks with the execution 
stacks. 

At this point, it will be helpful to define the notions of local and global variables with respect 
to a procedure P of an algorithm. 

Definition 4.2. A variable x declared in a procedure P is called a local variable of P. A variable 
y accessed by P and declared in a procedure Q calling P or used for the inputs or outputs of the 
algorithm A containing P is said to be global with respect to P. However, note that y would be a 
local variable of Q if declared in Q. 

4.1 Algorithmic Constraints 

The following lemma will motivate our definition of limited access algorithms. 

Lemma 4.1. Let t be a task whose execution is initiated by processor C , and let P be a block 
provided to C to store local variables of t. Let T be the time interval during which r is executed, 
and let T' be a subinterval ofT (possibly T' = T). Suppose that processors Ci, • • • ,Cfc are the only 
processors executing stolen subtasks of t during T' . Further suppose that they access block f3 a total 
of x times during T' . Then f3 incurs a block delay of at most 2x during T' . 

Proof. Processors Ci , • • • , cause at most x moves of block (3 to their caches as a result of their x 
accesses. Thus processor C needs at most x moves of block j3 to its cache to handle all its accesses, 
regardless of their number. ■ 

We begin by specifying the notion of an access bound for a task. 

Definition 4.3. Let t be a task and (3 a block on the execution stack of the processor C executing 
T. Let Ti, • • • , Tfc be the subtasks of t stolen from C . t is defined to have access bound s z/ti, • • • , 
access f3 at most s times, for every such block f3. 

Lemma 14.21 is an immediate consequence of Lemma 14.11 
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Lemma 4.2. // stolen subtask t is s access bounded, then each block j3 on Sr incurs an 0{s) block 
delay. 

We now identify a class of algorithms for which the access bound for their stolen subtasks r is 

0(min{|(r|,5)}. 

Our first constraint places a limit on how often each writable variable can be accessed. 

Property 4.1. An algorithm is limited-access if each of its writable variables is accessed 0{1) 
times. 

Due to procedure calls, including recursive ones, over time more than B variables could all 
share a single block, and so this property does not suffice to yield an 0(B) bound on the number 
of accesses to a single block. 

Our second constraint imposes upper and lower bounds on the space used by the tasks in our 
algorithms. To help specify this we create the following hierarchy of algorithms. 

Definition 4.4. A Tree Algorithm A is formed from the down-pass of a binary forking computation 
tree T followed by its up-pass, and satisfies the following additional properties. 

1. Each leaf node performs 0(1) computation. Each non-leaf node in the down-pass performs 

only 0(1) computation before it forks its two children. Likewise, each non-leaf node in the up-pass 
performs only 0{1) computation after the completion of its forked subtasks. 

2. Each node declares at most 0{1) local variables. In addition, A may use size 0{\T\) global 
arrays to store its output. If A is being used as a subroutine, these arrays are declared by the 
calling procedure; otherwise, they are the output arrays for the algorithm. 

Note that if ^ is a subroutine, then the global array for its output is stored on the execution 
stack of the procedure that calls A. While if it is the full algorithm, its global array is the algorithm 
output, which is stored separately from any execution stack. 

Thus, on its execution stack, the task for a Tree Algorithm with computation tree T will use 
space proportional to the height of T. Typically, in an efficient parallel algorithm, this height is 
0(log |T|). Henceforth, for short, we call such a task a tree task. In addition, the task initiating 
the computation will use space 0{T) for the global arrays on its execution stack. 

We create more complex algorithms, which we call Hierarchical Tree Algorithms, using sequenc- 
ing and recursion. 

Definition 4.5. A Hierarchical Tree Algorithm is one of the following: 

1. A Type Algorithm, a sequential computation of constant size. 

2. A Type 1, or Tree Algorithm. 

3. A Type i + 1 Hierarchical Tree Algorithm, fori > 1. An algorithm A is a Type i + 1 Hierarchical 
Tree Algorithm if, on an input of size n, it calls, in succession, a sequence of c > 1 collections 
of v{n) > 1 parallel recursive subproblems, where each subproblem has size s{n) < n/b{n), with 
b{n) > I + I' for some constant v > 0; further, each of these collections can be preceded and/or 
followed by 0(1) calls to Hierarchical Tree Algorithms of type at most i. 

Data is transferred to and from the recursive subproblems by means of variables (arrays) declared 
at the start of the calling procedure. Note that these arrays are local to the calling procedures, and 
global w.r.t. the recursive procedures. 



9 



4- A type max{ti,i2} Hierarchical Tree Algorithm results if it is a sequence of two Hierarchical 
Tree Algorithms of types ti and t2- 

The recursive forking of v{n) parallel tasks in a Hierarchical Tree Algorithm is incorporated 
into the binary forking in our multithreaded set-up by using a fork-join structure identical to that 
for the tree algorithms, except that each leaf of this tree corresponds to a recursive subproblem. 

4.1.1 Space Constraints 

We begin with two definitions regarding the space usage. 

Definition 4.6. An algorithm A with an input of size n is Exactly S\n) Space Bounded, also 
denoted by S\A), if it stores its local variables in space of size between S\n) and dS\n), for some 
constant d > 1. If A is a Type i Hierarchical Tree Algorithm, i > 2, in which each recursive call 
has an input of size at most s{n) < n/b{n), where b{n) < 1 — v for some constant z^, < < 1, 
then A is defined to have Path Space Bound S^{A) or S^{n) = X]j>o c*5''(s^*-' [n]), where c is the 
number of collections of recursive calls it makes. If A is obtained by sequencing Hierarchical Tree 
Algorithms B and C, then S'p{A) = max{S'*'(i3), S'^(C)}. While if A is a tree computation T, then 
SP{A) = SP{n) = G (height (T)). 

The following Property 14.21 imposes a lower bound on S^A). This property will be used in the 
proof of Lemma 14.41 

Property 4.2. A Type i + 1 Hierarchical Tree Algorithm A, i > 1, is top-dominant if for each 
type h procedure B it calls, h <i, S^{A) = U{SP{B)), and for h > 2, B is also top-dominant. 

In all the recursive algorithms we consider, S''{n),SP{n) = 0(n), and the tree algorithms have 
5'(n) = 0(1) and S'P(n) = O(logn). Thus these algorithms are all top-dominant. We also say such 
algorithms are Exactly Linear Space Bounded. 

The second space constraint concerns space allocation by the runtime system. 

Property 4.3. (Space Allocation Property.) Whenever a processor requests space it is allocated 
in block sized units; naturally, the allocations to different processors are disjoint and entail no block 
sharing. 

4.2 Analysis of the Cost of Block Misses 

Definition 4.7. The kernel of t is the portion of t remaining after its stolen subtasks are removed. 

We note that the processor C executing r 's kernel may change after a join of subtask t" executed 
by C with stolen subtask t' executed by C , in the event that C finishes executing r' after C finishes 
its execution of t" ; then C will take over the remainder of the execution of t's kernel. If there is 
such a change, we call it a usurpation by C . 

The following observation is readily seen and will be used in the proof of Lemma 14.31 

Observation 4.1. Let D be the series-parallel computation dag for a task r. Let v be the node in 
D corresponding to the last task t„ to be stolen during the execution of r 's kernel, and let Pr be the 
path in D from the root of D to the parent of v. Then, the set of tasks stolen from the processor(s) 
executing r 's kernel consists of some or all of the tasks corresponding to those nodes of D that are 
the right child of a node on but are not themselves on Pr. Further, they are stolen in top-down 
order with respect to the path Pr. 
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Lemma 4.3. Let A be a limited- access Tree Algorithm and let r be either the original task in the 
computation of A or a task which is stolen during the execution of A. Let (3 be a block used for t 's 
execution stack Sr- Then (3 incurs a block delay o/ 0(min{i?, ht(r)}) during t's execution, where 
ht(r) denotes the height of the corresponding computation tree. 

Proof. By Observation 14. !( there is a single path Pr, starting at the root node of r, such that stolen 
subtasks of r correspond to off-path right children of Pr. Each node v of Pr has a collection of 
0(1) local variables that are stored contiguously on Sr', we refer to the locations taken by these 
variables as the segment for v, which we denote by a^. The only segments that can be accessed 
by the stolen subtasks of Pr are the segments for nodes on Pr- In addition, these segments occupy 
disjoint portions of Sr- As each of the variables stored on Sr is a limited access variable, it follows 
that /3 can be accessed 0(min{i?, ht(r)}) times by the stolen subtasks, for, as already noted, r uses 
0(ht(r)) space. 

Each time a stolen subtask accesses /3 there may be a need to transfer f3; further, following this, 
there may be a need to transfer f3 back to the processor executing r. This causes 0(min{i?, ht(T)}) 
transfers of /3. In addition, if the execution of r shifts from one processor to another, this may 
entail further transfers of (3. We call such a transfer a usurpation. It can occur at a join, if the 
processor C executing a stolen subtask ending at this join is the last of the two joining tasks to 
finish; then C continues the work on r. But usurpations can only occur on the up-pass of the 
computation, and at this point the only further change to Sr is to shrink. As the variables on Sr 
are limited access, this means that there can be only 0(min{i?, ht(T)}) usurpations that involve 
accesses to /3 and hence further transfers of /3, namely one for each join with a stolen subtask of r. 
This is 0(min{i?, ht(r)}) transfers in total. ■ 

Remark 4.1. In our companion paper J^, we introduce padded BP algorithms. They are a vari- 
ant of BP algorithms (specified in Section\B^, which in turn are a variant of limited access Tree 
Algorithms. In padded BP algorithms, each node v declares an array of size ^/r, where r is the size 
of the subtask which starts at node v. These arrays are otherwise unused; their purpose is to reduce 
the number of block misses. Then the bound in the above lemma changes to 0{m.m{B, -\/|t]"}) {for 
in padded BP algorithms, the sizes of the nodes are geometrically decreasing as one descends the 
tree). 

Notation. Let v be the node in D initiating task r, where r is either the original task or a stolen 
task. Sometimes, we write Vr for v. We use both Ut, and ar to designate the segment on Sr for 
node Vr- 

Lemma 4.4. Let A be a limited- access, top-dominant. Type 2 Hierarchical Tree Algorithm. Let r 
be either the original task in the computation of A or a task which is stolen during the execution 
of A. Let P be a block used for t's execution stack Sr. Then the number of transfers of block (3 
during the execution of r is bounded by 



where B(5''(x)) is a tight bound on the space used by a recursive task of size x for its local variables 
in algorithm A. 





otherwise 
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If S^{n) = Q{n), the bound becomes 

y(M 0{cB) ^fs{\T\)>B 

^' ''^ \ 0(^.>oc^-s«(|r|)) otherwise 

If s{n) < (1 — ^)n/c this is an 0{m.va.{cB, |t|}) bound. 

Proof. As in the proof of Lemma 14.31 by Lemma 14. H it suffices to bound the number of accesses 
by stolen subtasks of r to block (3 plus the number of usurpations. 

Again, there is a single path P^- in the computation dag D, such that the only segment accessed 
by a stolen subtask r' of r is the segment on Sr corresponding to the parent of v-r' on Pr- However, 
as the segments for successive nodes on P^ may reuse space on Sr (if they are present for disjoint 
time intervals), conceivably the sum of the lengths, and hence the number of accesses to, the 
portions of these segments in /3 is Q{B). It remains to bound the sum of these lengths. 

Of the blocks storing portions of Sr, we focus on those for which the variables it stores (as 
opposed to their values) may change over the course of r's execution. These will be blocks holding 
segments (or portions of segments) whose lifetime is shorter than r's, i.e. they are for nodes that 
are strict descendants of Vr on Pr. For any other block /3 storing portions of ar, there can be only 
0{mm{B, S'drl)}) accesses to /3 during r's execution, by the limited access property. 

Next, we explain the sequence of segments for Type 2 tasks that can be present simultaneously 
on Sr 0(3 and that correspond to nodes on Pr (we call these Type 2 segments henceforth). There is 
the segment ar for r, followed by a segment for ri, where ri is called recursively by r, followed 
by a segment (T7-2 for T2, where T2 is called recursively by ri, and so forth. If c > 1, over time there 
will be up to c distinct ri, one from each collection of recursive calls, up to distinct T2, and so 
forth. These c ri will reuse the same space on Sr, as will the T2, etc. However, segments present 
simultaneously use disjoint space. 

Thus, if the segment for each ri uses at least B space, then f3 is overlapped by just ar and 
the (Tri. In this case, the sum of the space used in /3 by each of these Type 2 segments is 0{cB). 
Otherwise, during such an interval of time, the sum of the space used in /3 by the Type 2 segments 
is bounded by: 0(Ei>oC* • [s^"\\t\)]) ■ 

We still need to account for the space used by tree tasks whose segments are on Sr H /3. We will 
bound this by 0(min{i?, 5''(|r|)}) plus a constant times the space used by the Type 2 segments. To 
this end, we charge the accesses in /3 to the segment for a tree task to the segment cr^ for the Type 
2 task ij, that called v. By Propertv 14.21 this charge is 0{S^(a^)) (this is where top dominance, 
the lower bound on 5''(|/i|) is used). Since we are concerned with accesses to block /3, this charging 
is legitimate only if a^ lies fully in f3. This need not be the case for one segment, namely the 
segment, if any, which overlaps /? and its predecessor block /?' on Sr', call this segment aj[. Instead 
of charging a-p, the charges for the 0(1) tree tasks called by a-p are paid for directly. The argument 
is the same as in Lemma l43l each tree task u called by ajz uses space 0{mm{B,h.t{u\)}) on (3; as 
A is top-dominant this is 0{S''{a-p)) = 0(min{i?, S''(|r|)}). The remaining charged segments are 
fully contained in block /3, and so total cost for accesses to tree segments in /3 is 0(min{i?, 5''(|r|)}) 
plus (a constant times) the space used by the Type 2 segments. 

Again, we need to account for the effect of usurpations. But as in the proof of Lemma 14.3^ the 
number of usurpations that cause transfers of block /3 is bounded by the number of variables stored 
on /3 during the execution of r and we have already bounded this quantity. So this does not affect 
the overall bound. ■ 
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This result could be extended to higher type recursive algorithms. To prove it for type i + 1 
algorithms, the tree computations are replaced by type i computations in the above argument, 
which is otherwise unchanged. 

Remark 4.2. The bound in the above lemma applies even if the tree algorithms are padded BP 
algorithms, for 5^(|z^|) = 0(^/1') for a padded BP tree algorithm task v, and this will be upper 
bounded by ^'d/ij) for the Type 2 task ^ calling u. 

In all our Type 2 algorithms, S\n),S'P{n) = B(n). 

Remark 4.3. When employing an algorithm A as a subroutine, where A reads its input repeatedly, 
assuming A 's input was generated by the calling procedure, we need to ensure A 's input is only read 
0(1) times, at least if the call to A can run in parallel to other work. Our Matrix Multiply algorithms 
are examples for which this arises. One way of ensuring this for recursive tasks is for them to use 
local variables to copy their input. 

4.3 Algorithm Examples 

For the MM algorithms, S^n^) = O(n^) (this is the space used to store the results of the MM 
subproblem being computed by the task.) 

The MM algorithms have the feature that for matrices in the BI format, each stolen subtask 
writes to 0(1) blocks shared with its parent task, and consequently by Lemma 14.41 induces an 
additional 0{B) delay, measured in cache miss units. The following lemma is immediate. 

Lemma 4.5. The MM algorithms incur delay 0[S ■ B) due to the block misses if they undergo S 
steals. 

Indeed, this is a design principle that is followed in the already mentioned oblivious sorting 
algorithm [7] and in the algorithms described in our companion paper [6j: ensure that each subtask 
accesses only 0(1) writable blocks shared with other tasks. (One could generalize this to any 
bound Z with a proportionate increase in the overall costs of 0{S ■ Z ■ B) due to the block misses, 
of course). 

We can now explain our algorithms for converting between the RM and BI formats for storing 
matrices. 

To go from RM to BI, the straightforward algorithm which recursively copies each quadrant 
using a tree computation suffices, it results in Tqo = O(logn), W = 0{n^), Q = r? jB. 

Let Tk denote the kernel of task r (the remainder of r when stolen subtasks are removed). The 
original task r or a stolen task r each have a cache miss cost of 0(|rK|/-B + \/t + ^\ ^ they are 
reading from a size yfr x -^r submatrix in RM format. Summed over all stolen tasks, plus the 
original task, this is 0(n^/i? + ^^ stolen \/^)- "^^^ additional block delay caused by r being stolen 
is 0{B) as r is writing in left to right order into a vector, the BI format for the matrix, and so 
the only blocks on which it has access conflicts are the leftmost and rightmost blocks to which it 
writes. 

Lemma 4.6. The above algorithm for converting RM to BI format incurs 0{v? /B + n^/S) cache 
misses and a delay of 0{S ■ B) due to its block misses. 
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Proof. The number of cache misses is maximized if the sizes of the stolen tasks are as large as 
possible, namely of size ©(n^/S") or larger. This causes Ylr stolen V^D ~ 0{n\/^). 

The bound on the block delay is immediate. ■ 

Comment. We observe that n'^/B + n^/S = 0{ii?/B + S ■ B). 

Consequently, the cache miss cost is bounded by the sequential cache miss cost plus the block 
miss delay. 

However, going from BI to RM is not a symmetric process; the direct logarithmic depth tree 
algorithm incurs many block misses because a subtask r may write to ©(y^pl") blocks shared with 
other tasks. But this computation is being used for matrix multiply, so we can afford an algorithm 
with Too = 0(log^ n). This slower runtime allows us to sharply reduce the number of block misses. 

This algorithm divides the length n? BI representation array into four parts, each of which it 
recursively converts to RM order. Then, using a tree computation, it copies the four subarrays into 

2 1 2 

one subarray in RM order. For this algorithm, W = 0{n? logn) and Q = logM )• 

In this tree computation, each task reads from two arrays in RM order to produce its output 
in one array again in RM order. 

2 

Lemma 4.7. The above algorithm for converting BI to RM format incurs 0{^S) cache misses 
and a delay of 0{S ■ B) due to its block misses. 

Proof. For each stolen task r with kernel r^, there are ©([Ir^lZ-B]) cache misses. The k stolen 
subtasks at a given level of recursion, of total size r < n^, incur 0{k + r/B) cache misses. The 
number of cache misses is maximized, if at each level, starting at the highest level, each stolen task 
is as large as possible and between them they have combined size n^. This yields log S) cache 
misses. 

Again, the cost of the block misses is 0{B) per steal. ■ 

An improved method for BI to RM conversion with T^o = O(logn) is given in [6]. 

As we will see, the number of steals in the MM algorithms dominate those for the above BI to 
RM algorithm, and consequently the MM algorithms dominate the BI to RM algorithm in terms 
of operation count, runtime, and combined cache miss and block delay. Thus the just described 
conversion algorithms can be used with the MM algorithms without affecting the asymptotic com- 
plexity of the MM algorithms. 

5 The Analysis of RWS with False Sharing 

Here we analyze the performance of randomized work-stealing [H [2] when the cost of block misses 
is incorporated. Our analysis follows the approach taken in [T], but without the assumptions, made 
in [1], of an 0(1) block size and no false-sharing. We note that even if S = 0(1), when false sharing 
is allowed, showing an 0(B) (= 0(1)) bound on the block miss delay in serving any access request 
to a block appears to require a non-trivial justification, as presented in Section HI 

We assume that a successful steal takes between s and a2S time, for some constant 02 > 1, 
and an unsuccessful one takes 0{s) time (prior work assumed both took ©(s) time). We assume 
that s > b, which seems plausible for each steal requires reading data on another processor and 
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consequently a cache miss seems to be unavoidable. Also, henceforth, for simplicity, we assume 
that s is an integer multiple of b. 

As in [1], we bound the number of steals by using a potential function (p, which we now define. 
We assign a cost to each node in the execution dag D for a given computation. To this end, 
recall that E is an upper bound, measured in cache misses, on the delay due to cache and block 
misses occurring in the execution of any one node, and let ei be an upper bound on the number 
of operations (reads, writes and computations) performed in the execution of any one node. By 
assumption, ei = 0(1). Each node is given a cost of ei + bE. In addition, to cover the cost of 
steals, any node performing a fork is given an additional cost of 2s (the factor of 2 simplifies the 
analysis). (This additional cost incorporates all the delay incurred by the fork including any block 
misses that may ensue.) The cost of a path in D is simply the sum of the costs of the nodes on the 
path. 

The height h{u) of a vertex n in D is 1/s times the maximum cost among all the paths descending 
from u. h{t) denotes the height of the root t of D. Note that h{t) = 0{\{ei + bE + s\)T^) = 
0{[^E + l]Too), where Too denotes the length, in vertices, of the longest path in D. 

We view each task corresponding to a node as performing up to ei + bE "work units" when it is 
executed, each work unit corresponding to one unit of time being expended on its execution. This 
includes time spend waiting due to cache and block misses. 

If the task associated with vertex is on a task queue, u has an associated potential ipiu) = 
2i+/i{«). jf jg currently being executed by a processor, with x of its work units already having 
been performed, u has potential (/>(u) = 2^(")^(^/*); otherwise, u's potential is zero. = Xlu'^^l^)- 

To show progress, we analyze the algorithm in periods called phases. We identify two types of 
phases, steal and computation phases. At the start of a new phase, if at least half the potential 
(f) is associated with vertices u whose associated tasks are on task queues, this is a steal phase. 
Otherwise, it is a computation phase. A steal phase lasts until 2p attempted steals complete, 
successfully or not, while a computation phase lasts for b time units. We show that the expected 
value of the potential function (j) decreases at least in proportion to the number of successful steals. 

Lemma 5.1. In a steal phase, the expected value of (j) reduces to at most | of its starting value. 

Proof. Potential of at least ^ is associated with tasks at the heads of queues, since, on any task 
queue, the heights of successive tasks decrease by a factor of at least 2, and hence at least | of 
the potential associated with tasks on task queues is for tasks at the heads of these queues. Let 
Tu be a task at the head of a task queue. The probability that Tu is not stolen in one attempted 
steal is 1 — 1/p. Hence over the at least 2p attempted steals, it is not stolen with probability 
(1 — 1/p)^^ < 1/e^, and hence is stolen with probability more than |. li is stolen, the potential 
(j){u) decreases by a factor of 2. Consequently, the expected value of (p is reduced to at most 

Corollary 5.1. With probability at least j^, in a steal phase (j) reduces to at most y| of its starting 
value. 

Proof. Otherwise, the expected decrease is less than ^ • 1 + (l — JE ^ ' 

A computation phase lasts for b time units. 
Lemma 5.2. In a computation phase, (j) reduces to at most (l — ^) of its starting value. 
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Proof. Suppose processor C is currently executing task Tu corresponding to vertex u. Then in the 
current phase, C can do one of three things. 

a. It could complete its task with nothing left on its task queue. Then the associated potential is 
reduced to zero. 

b. It could perform a fork. We show that this reduces the associated potential (f){u) to at most 
j(t){u). When a processor executing task forks, it creates tasks ry and t^, placing on its 
task queue. Recall that each forking node is assigned an additional cost of 2s. Hence, the forked 
task V that is placed on the task queue has potential (j){u)/2, and the forked task w that continues 
the execution has potential (f){u)/A. Thus, the potential is reduced from cj}{u) to (j){v) + (j){w) < 
cP{u)/A + ct>{u)/2 = l^{u). 

c. If (a) and (b) do not hold, then processor C executes its task throughout the phase without 
forking. Hence processor C performs a sequence of at least b work units. This reduces the starting 
potential 0(n) to at most (/)(u)2-^/^ = 0(n)(l + 1)"^/^ < 4>{u){l - ^) if & < s. (Note that for 
< X < 1, on setting x' = l-x,we have (1 + l)"'' = (1 + 1)-^+^'' = ^(1 + 1)^'' < ^[l + x' - x'{l - 
x')/2\ + x{l - x'){2 - x')/3\ + ■ ■ ■] < ^[1 + x'] < l-x/2.) 

Hence in one computation phase the potential is reduced to at most 2~'^{^ ~ ^) 2 ~ ~ Js) 

m 

Theorem 5.1. For a = oj{1), with probability (l — 2"®*^"''^*^)) , the number of successful steals is 
bounded by 0{p- h(t)[l + a]). In addition, the time spend by all the processors collectively on steals, 
successful and unsuccessful, is 0{p ■ s ■ h{t)[l + a]). 

RecaU that h{t) = 0([^S + l]Too). 

Proof. The initial value of is 2^^^\ While one node remains unexecuted, (j) > I. Thus once (p 
reduces to 1, the computation completes in a further 0(1) time, in which time only 0{p) attempted 
steals can complete. So to bound the number of attempted steals, it will suffice to consider the 
time during which (p reduces from its initial value to 1. 

Say that a steal phase is successful if (j) reduces to at most y| of its value at the start of the 
phase, and that it is unsuccessful otherwise. By Corollary 15.11 a steal phase is successful with 
probability at least ^. 

Suppose that there are x successful steal phases, y unsuccessful ones, and z computation phases, 
until all the successful steals complete. Then x + = 0{h{t)). 

Now each computation phase takes b time units and hence uses 0{pb) time over all p processors. 
So the z computation phases use 0{pbz) time units over all p processors. As a successful steal 
takes at least s time units, there can be only 0{pbz/s) successful steals that start and finish in a 
contiguous sequence of computation phases. Any other successful steal either starts or ends during 
a steal phase; there can be at most 2p of these per steal phase, 0{p{x + y)) in total. 

Finally, a steal phase is unsuccessful with probability at most y|. By a standard computation 
(which asks what is the probability of fewer than x successful coin tosses in a sequence of (a + l)x 
coin tosses), with probability 1 — 2~®("^), there are 0{x ■ a) unsuccessful steal phases, which yields 
a total of O ((x(l + a) + \z) p) steals with probability 1 - 2-®("^). 

To bound the time spend on steals we note that, summed over all the processors, a steal phase 
uses time 0{sp) for the steals and a computation phase 0{bp) time. Thus the time cost of the 
steals is 0((x + y)ps + zph) = 0((x(l + a) + -^z)ps) = 0{h{t){l + a)ps). ■ 
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In all the algorithms we consider, E = 0{B). However, this bound need not hold in general. 

6 Analysis of HBP Algorithms 

HBP algorithms are obtained by imposing a further modest restriction on the Hierarchical Tree 
algorithms. First, we define Balanced Parallel (BP) algorithms. These are Tree Algorithms, but 
with the further restriction that they have roughly equal-sized subproblems; for a tree computation 
T, where r is the corresponding task, there are constants ci < 1 < C2 and a < 1, such that 
subtasks corresponding to the subtrees at the ith level all have sizes between ci|r|a* and C2|r|a*. 
Next, we define Type i Hierarchical Balanced Parallel (HBP) Algorithms, which correspond to Type 
i Hierarchical Tree Algorithms. Here, we apply a similar restriction to the trees forking recursive 
computations, namely that the number of leaves in the subtrees at a given level are all within a 
constant factor of each other ([6] also requires the size of the recursive subproblems to be within a 
constant factor of each other, but this is not needed for the analysis below). 

As it suffices to analyze BP and Type 2 HBP algorithms to also handle the analyses in our 
companion paper, we will limit the analysis below to these classes. (As it happens, this analysis 
allows us to handle the Type 3 algorithm for list ranking and the Type 4 algorithm for connected 
components.) 

Broad-brush, our analysis is similar to that in Section [5j Again, we associate a "height" or level 
h{u) with each node in the computation dag D, and show that the same resulting potential function 
decreases as before. However, we improve the bounds from Section [5]by reducing the over-counting 
of delays due to block misses. For example, if two processors are competing to access a block, only 
one of them will be delayed on their first access. Our current analysis assumes both are delayed. 
As it turns out, this has a substantial effect on our bounds. 

We do this by introducing levels for nodes in D that change dynamically as the algorithm 
execution proceeds. Each node u in the dag D will have a current level, h[u). h{u) may decrease as 
u is executed; it may even change for a node that has not yet been expanded in the dynamic DAG. 
The challenge in designing the level h is that for an edge (n, v), the diff'erence h{u) — h{v) needs to 
be at least the time taken to perform the operations at node u, which could include the effect of a 
block miss delay. However, we want this difference to be large only if there really is a block delay. 
Since the nodes at which block delays occur depend on the order of execution, our analysis needs 
to account for every possible order. All of these present challenges in setting up the h function. 
In our approach, we enable sufficiently large differences when needed by dynamically reducing the 
value of h{w) for a suitable subset of the nodes that could access block /3 when /3 incurs a block 
delay; we also reduce the h value for selected descendants of such nodes w. 

The effect is to reduce the prior bound of h{u) = 0{Blogn) for a size n BP computation to 
0{B + logn); in fact, the tighter bound h{u) = 0(logn -|- min{n, B}) holds. 

We begin by analyzing BP computations. The extension to HBP computations is fairly straight- 
forward. 

6.1 BP Algorithm Analysis 

There are two classes of delays we need to consider. The first are due to accesses to global arrays; 
by Lemma 14. 4|, there are 0{B) delays per accessed block. BP algorithms have the following feature 
which allow them to avoid a cascading series of delays along a path in the computation dag D: 
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Regular Pattern for BP Global Variable Access. All writes to global variables 
(typically arrays of size n) are performed either at the leaf nodes or in the following 
regular pattern: the ith node in the down-pass tree in inorder writes locations [a{i — 
1) + 1 • • a • i] for some constant a > 1, and similarly for nodes in the up-pass tree. 

Prefix-sums can be implemented as a sequence of two BP computations with a regular pattern. 

Together with the balanced subtree requirement, this feature ensures that the only conflicts 
in accessing global arrays occur at nodes in the bottom ai log B levels of the up-pass tree, for a 
suitable constant oi > 1, and the same is true for nodes in the down-pass tree. We call the height 
ai log B subtrees formed by these nodes conflict subtrees. The key property is that between them, 
the nodes in each pair of complementary conflict subtrees (one in the down-pass tree, one in the 
up-pass tree) access the same 0(1) blocks, and thus the cumulative delay due to accesses to the 
global arrays is 0{B) (measured in units of cache misses). 

The second class of accesses are to variables stored on the execution stack (including hidden 
variables such as those for reporting the completion of a subtask). We limit ourselves to BP 
algorithms which have the following additional feature (stemming from a natural scoping of variables 
and the use of return value variables); this is used in bounding the cost of these accesses. 

BP Local Variable Access. Writes to local variables by the task for a node v are to 
v's local variables, and in the up-pass possibly to n's local variables also, where u is v's 
parent (the node following v in the up-pass computation). 

Recall that we let ei = 0(1) be a bound on the number of reads and writes performed at any 
one node in D. We also let 62 be the number of times a writable variable can be accessed during 
the computation. Recall that, by the limited access property, 62 = 0(1) also. Finally, we let 
e = max{ei, 62}. 

We also assume that the blocks storing the execution stack for C and those storing C"s task 
queue are disjoint. 

Next, we define the current levels for the nodes in D. We specify four levels, ^1, ^2, ^3, ^4) where 

11 for the effects of steals, £2 for the effects of global variables, £3 accounts for the effects of local 
variables, and £4 for the interactions between local and global variables. The current level, h{u) of 
a node u is given by h{u) = li{u) + f [^2(1^) + ^^{u) + ^4(n)]. 

A key property that we enforce on the levels is that for each edge {u^v) in D, £i{u) > ii{v), 
for i = 2,3,4, and £i{u) > £i{v) + 2. Also, £i{u) > for 1 < i < 4. These properties will allow 
us to show essentially the same reduction in potential during a computation phase as was shown 
previously in Lemma 15.21 (see Lemma 16.81 below). 

ii{u) is the simplest so we define it first. Let ht(n) be the actual height of u in the dag 
D, measured in edges. Then ii{u) = 2ht(u) > 0. ii{u) remains unchanged throughout the 
computation. Note that if {u,v) is an edge in D, then ii{u) > ii{v) + 2. 

12 definition and analysis To specify £2 we need to define the following height 0(logi3) conflict 
subtrees T at the bottom (leaf level end) of the up-pass tree. Let depth d + 1 be the greatest depth 
such that all the subtrees at depth d+1 have i? — 1 or more nodes. Then the nodes at depth d are 
the roots of the conflict subtrees. We define analogous conflict subtrees in the down-pass tree. We 
pair complementary conflict subtrees, namely those comprising paired fork-join nodes. By the BP 
Global Variable Access property, any access by the root of a conflict subtree T to a global array 
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can have a conflict only with other nodes in T and its paired tree T', and thus in fact there are 
no conflicts as a result of these accesses: for in the case of the up-pass tree T, the nodes below its 
root will have completed before the root starts its computation, and in the case of the down-pass 
tree T', the nodes below the root will start their computation only after the root completes. This 
conflict free property also applies to nodes nearer the root of the up-pass and down-pass trees. 

Lemma 6.1. A conflict subtree has at most 4:^{B — 2) -|- 3 nodes. 

Proof. Necessarily, there is a subtree with root at depth d + 2 with B — 2 or fewer nodes. Thus the 
largest subtree at this depth has at most ^{B — 2) nodes. Hence, every conflict tree has at most 
4^(S-2)+3nodes. ' U 

Let (5 he a block used to store parts of the global array (s). For each /5 and each conflict subtree 
T some of whose nodes access /3, every node u in T and in its paired subtree T' has eB added to 
the initial value of £2{v) (from a starting value of 0). For short, we say that T can touch p. 

Whenever there is an access to block /3, for each conflict subtree T with a node that can write 
to P, i2{v) is decremented by one for every node v in T and its paired tree T'. All decrements 
will be by 1, so will not mention this henceforth. Note that while the access is by some node u in 
some conflict tree T, decrements may also occur to £2('f^') for the nodes u' in a neighboring conflict 
subtree T". The net effect is that for each conflict subtree T, l2{v) is identical for every node v in 
T and the paired T' . 

The remaining nodes in the up-pass tree, those not in conflict subtrees, arc given an ^2 value 
of 0. 

Lemma 6.2. For v in the up-pass tree, initially l2{v) < ^^e^S, and always i2{v) > 0. 

Proof. A global array to which nodes write e' items has at most • [4^(5 — 2) + 3]] < 4e'^ 
blocks that can be touched by conflict subtree T. Recall that each node performs at most e accesses. 
Summing the 4e'^ bound over all e', one per global array, yields a bound of 4e^ blocks (since the 
sum of the e' is at most e) . Each such block contributes eB to the initial value of £2 {v) for v in T, 
yielding the initial bound of Ae^e^B on l2{v). 

To obtain the second bound, we note that for each block /3 that T can touch, by the limited 
access property, there are at most eB decrements to £2{v) for v in T. As there was an initial 
contribution of eB to £2(1') for /S, it follows that £2{v) remains non-negative forever. ■ 

In the down-pass tree, for each node u outside of any conflict subtree, £2{u) = Ae^^B. It follows 
that on any path in the dag -D, £2 is non- increasing. 
We have shown: 

Lemma 6.3. For all nodes v in the dag D, initially £2{v) < A^e^B, and always £2{v) > 0. Further, 
for every edge {u,v) in D, £2{u) > £2{v). 

£3 definition and analysis Recall that £3 is intended to handle accesses to local variables and 
these are restricted as specified in the BP Local Variable Access Property. 

We will consider a task r, which is either the initial task in the computation or a stolen subtask. 
We will be concerned with analyzing the delays in accessing local variables on Sr, t's execution 
stack. Recall that the kernel of r is the portion of r that does not get stolen. Also recall the 



19 



definition of a segment for a node v in the computation dag D: it stores the variables declared 
by V, if is a fork or leaf node. If v' is a join node, its segment is the segment created by the 
corresponding fork node v. The segments on Sr correspond to the sequence of fork nodes in the 
kernel whose corresponding join tasks have not yet been completed, plus possibly one segment at 
the top of the stack for a leaf node, if this is the node in r's kernel being executed currently. 

We focus on the path in D. Recall that this is the path of nodes starting at the root node 
(i.e. the initial node) for r's computation, and for which the ofF-path right children correspond to 
the roots of r's stolen subtasks. This is a path in the down-pass tree. We define a corresponding 
path in the up-pass tree: if f is a node on Pr, necessarily a fork node, then the corresponding 
join node v' is on P^. It is convenient to extent Pr (and P^) so as to connect these two paths, as 
follows. Let u! be the root node for the last stolen subtask of r, and let v be w^s sibling. Then v 
plus the path of right children descending from u to a leaf of the down-pass tree forms the extension 
of Pr, and the corresponding nodes in the up-pass tree form the extension of P^. This leaf node is 
common to both paths. 

We still need to identify the nodes on Pr and P^ and their offpath children more precisely. See 
Figured) Let (fi, f2, • • • , v^), be the path of nodes forming Pr, descending the down-pass tree, and 
including a leaf. Let {vr-^^,Vr2, • • • ,Vrt) be the subsequence of these nodes at which a steal occur; 
i.e. their right children are the root nodes for stolen subtasks of r, and let {wr^j^i,Wr2+i, • • • , Wn+i) 
be this sequence of right children. Let v^ be the node on P^ corresponding to Vi] i.e. P^ comprises 
the path • • • ,v[) of nodes ascending the up-pass tree. Let "w^r^+i be the off-path child of 

v'r^ in the up-pass tree, 1 < s < t; note that tf^^s+i corresponds to node Wj-^+i in the down-pass 
tree. Sometimes we will call Wr^+i the non-kernel child of Vr^, and similarly for tt'^^+i w.r.t. v'^.^. 

Let /3 be a block storing a portion of Sr- By the BP Local Variable Access Property, the only 
nodes that can access (3 are nodes in r's kernel and nodes w'j._^_^i, 1 < s < t. We define £3 so as 
to ensure a sufficient reduction in the overall potential when such accesses occur. As it happens 
some of these accesses will not cause a reduction to the £3 values, but as we will see in the proof 
of Lemma 16.81 the potential associated with the nodes involved in such non-reducing accesses is 
small, and thus much of the potential is for nodes whose potential does drop, ensuring a sufficient 
overall drop in potential during an execution phase. 

Now, we are ready to define £3 precisely. 

Initial values of ^3. 

For a node v' in the up-pass tree, initially, 

^siv') = 2e ■ length in vertices of the path from v' to the root t of the up-pass tree. 

So i3{t) = 2e initially (recall that t denotes the root of the up-pass tree). 

Let ^3(7) be the maximum initial value of £3 over all leaves in the up-pass tree (which are also 
the leaves in the down-pass tree). 

For non-leaf nodes v in the down-pass tree, 

^siv) = ^sif) + e • height of the down-pass tree -|- e • (height of v in the down-pass tree — 1). 

Update specification. We now describe how £3 is updated during the computation. In contrast 
to the other ii, these updates do not always pay for the cost of the current block miss. In particular, 
block wait costs at some of the affected w'j may not be charged under these updates, and in one 
case we do not decrease £3 at any node even though block wait costs can be incurred at some nodes. 
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Figure 1: Notation for Down Pass and Up Pass Trees. 
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In the proof of Lemma 16.81 we set up a charging scheme which we use to estabhsh that all of these 
incurred costs are covered by our update mechanism for £3. 

The down-pass for r is defined to end when only one leaf in r's kernel remains to be executed; 
this is the leaf on P^-. During the down-pass, £3 is updated as follows. 

— If there is a node on the task queue for the processor executing the kernel of r, then no 
decrement to the £3 values occurs for accesses to Sr- 

— Otherwise (i.e., the task queue is empty), for each block /3 storing a portion of Sr, if an access 
to /3 by 

w'rg-\-\ completes while other nodes accessing /3 undergo block misses, ^3(f ) is decremented 
for all non-completed, non-leaf proper descendants of Vr^ in the down-pass tree. 

— The remaining possibility is that an access by vi to /3 completes, where Vi is the topmost 
unexecuted node on P^; then only Iziyi) is decremented. 

By the time the down-pass of r completes, the only nodes remaining to be executed in the 
kernel of r are the nodes on P^. Portions of stolen subtasks of r may also still be under execution. 

Note that in the up-pass, v[ is executed only after all v'^ j > i, have completed. We now specify 
updates to £3 during the up-pass of r. 

— Consider a vertex v'^, for some i, 1 < i < k. When an access to (3 by v'^ completes, isiv!^ 
is decremented, as is ^s^w'^) if w'^^ is not completed and it is the non- kernel child of v[_^ (i-e., the 
stolen sibling of ?;•). This occurs if v[_i is a node at which a join with a stolen subtask of r occurs. 

— When an access to block /3 by w'j.^^i completes, for some s, 1 < s < t, the following 
decrements occur: to isivj) for j > Vg + 1, and to £3(111'^ ,^^) for s' > s, for non-completed Vj and 

Lemma 6.4. Let u' be the parent of v' in the up-pass tree. Then £2,{v') > i^iu') > always. 

Proof. If u' and v' are both in r's kernel, and v' is not on P^, then v completes before the up-pass 
begins, i.e. before isiu') and i3{v') change from their initial values; so in this case, isiv') > ^^{u') 
and ^3(1'') > always. 

It remains to consider nodes on P!^ and their non- kernel children. Suppose that u' = v[_-^ 
for some 1 < i < k. Note that there are at most 2e decrements to ^3(f^), and if w'^ is a non- 
kernel node, to l3{w[), for which there are not simultaneous decrements to i3{v[_i). As a result, 
h{vU)<Uv[),h{w[) always. 

It follows that isiu') < £3(1;') always, for all parent-child pairs in the up-pass tree. 

We turn to the second part of the claim. Recall that t denotes the root of the dag. isit) is 
decremented only when t succeeds in an access; i.e. at most e times. So ^3(t) > e > always. The 
property that is^v') > follows from the first part by induction on the depth of v' in the up-pass 
tree. ■ 

Lemma 6.5. Let u be the parent of v in the down-pass tree. Then i3{u) > i^i^v) > always. 

Proof. For a node v at depth d in the down-pass tree, there are at most e{d + 1) decrements of 
£3(1;) (note that we define the root of the down-pass tree to have depth 0). More specifically, there 
are e decrements at node v and e decrements due to each w^^+i for which Vr^ is a proper ancestor 
of V. 

Thus if u is the parent of leaves in the down-pass tree, then £3{u) undergoes at most e • 
height of the down-pass tree decrements; consequently, tsiu) > isi^v), for v a child of u. 
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And for a node u that has a non-leaf child, i-^{u) can undergo at most e decrements that 
its children do not face, hence to ensure that ^zi^) > ^3(^^)7 for v a child of u, it suffices that 
isiu) > i3{v) + e initially. 

For a leaf node v, isiv) > follows by Lemma 16.41 For other nodes in the down-pass tree, 
^siv) > follows by induction on the height of v in the down-pass tree. ■ 

It follows from Lemmas 16.41 and 16.51 that: 

Lemma 6.6. £3 only decreases on descending the dag D. Further, ^3(n) = O(logn) for any node 
u in an n-leaf dag D, and isiu) > always. 

Next, we bound the numbers of successful and unsuccessful steals by means of a potential 
argument, using the same potential function as in Section [5] If n is on a task queue, u has potential 
(f>{u) = 2-^+'*("); if Tu is currently being executed by a processor, with x of its work units already 
having been performed, u has potential (p{u) = 2'*^")"^^/'*^; otherwise, n's potential is zero. <j) = 
^^0(n). However, we redefine the notion of work units. Here, we allocate b work units for each 
read or write, which allows for a successful access to the relevant variable, but in the case of an 
access delayed due to a block miss, the potential will reduce only through changes to h{u). 

We define steal and computation phases as before, except that a computation phase will now 
last for 2b steps; this ensures that in a computation phase, a processor C with enough work will 
either complete b steps of work, or there is a successful access to a block /? by a processor competing 
with C for access. The same proof as in Lemma l5 . II shows : 

Lemma 6.7. In a steal phase, the expected value of (j) reduces to at most | of its starting value. 
Lemma 6.8. In a computation phase, (p reduces to at most (1 — (b/s)) of its starting value. 

Proof. For each node u being executed there are two possible outcomes during the phase. 

1. n's execution completes, ending with: 

a. A fork. 

Suppose that this creates nodes v and w with v being executed and u; on a task queue. By 
the definition of h, £i{u) > ii{w) + 2, ei{v) + 2, and hence h{u) > h{w) + 2,h{v) + 2, and 
> 2(f>{w) , 4:(f>{v) . Therefore, (j){v) + (j){w) < |0(m), at least a (1 — Q{b/s)) reduction. 

b. A join. 

Suppose that it activates node v. Again, £i{u) > £i{v) + 2 and so < j4>{u), at least a 
(1 — Q{b/s)) reduction. 

c. u simply ends (because its sibling node is not finished and so a join cannot occur yet). 

Here the potential reduces from </>(n) to 0, certainly at least a (1 — 0(6/s)) reduction. 

2. u runs for 2b time units and completes at least b work units. 
Then (p{u) decreases by at least 2~^l^ . 

3. u runs for 2b time units and incurs a block miss delay due to some other read or write succeeding 
on a block /3. 

If /? is storing part of a global array, then either t2{u) or ttSu) decreases, causing 4>{u) to decrease 
by at least 2~^l^ . If /3 is storing only local variables, then we need to consider what happens to 
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£3{u). In this case, suppose that P is storing a portion of execution stack Sr- u need not be a part 
of r. 

if i^{u) decreases, then h{u) decreases by ^, and hence (piu) decreases by 2^^/^. 

However, l^iu) may not decrease. This leads to three more cases, which we consider below, and 
are the most nontrivial part of the analysis. 

a. r is currently executing its up-pass. 

Let i be the least j such that h{v'j) is decremented. Then the only nodes that may incur 
a block miss without a corresponding decrement of their ^3 value are w;^ for + 1 < i 
Now ct>{v[) + Er.+Ki 4>{w'r^+i) < + 1 + TE + ■■■]< iHv'i)- Thus the decrement to 

H^i) + Er,+i<i ^(^r.+i) is by a factor of at least f • |. 

b. r is currently executing its down-pass and some ^3(^1) is decremented. 

The only nodes that may incur a block miss without a corresponding decrement of their £3 

value are rs + l<i. Now (/>K) + Er,+i<i '^(^r.+i) ^ 4>{vi)[l + ^ + jq ^ ] < 

Thus the decrement to 4>{vi) + J2rs+i<i '^i'^Ks+i) ^y a factor of at least | • |. 

c. r is currently executing its down-pass but no ^3 value is decremented for nodes accessing /3. 

We will show below that the total potential of nodes covered by this case is ^(p, where (f) is 
the total potential at the start of the current computation phase. This will suffice, for 

then it follows that the nodes covered by the other cases in (1) and (2) have combined potential 
at least — ^)(t> = ^(f) at the start of the phase, and hence the overall potential reduction is 
by a factor of at least g • f 5 ■ 

Consider the nodes accessing f3. As there is no reduction to the ^3 values of nodes accessing /3, 
there is a node Vi on P^- which is on the task queue. By the Local Variable Access Property, 
the only nodes which can be accessing P are a node v in the kernel of r and nodes u^^s+i ^^^^ 
rs -|- 1 < i (for nodes tfr^+i with -|- 1 > i have not yet been initiated and so nor have nodes 
w' with r' + 1 > i). Note that v is either the sibling of Vi or a descendant of that sibling. 
Since Vi is on the task queue, it follows that ^(fj) > 2(^(w). Further, since £3 reduces by 2 
in each successive level of the up-tree, we have (f){v) + X^^s+ki ^^(^rs+i) — l^(^) ~ 
Summing over all such v yields a total of at most ^cf) potential associated with nodes incurring 
block misses on blocks /3 with no ensuing change to their £3 values. 

(4) There might appear to be one more case to consider, namely when r is a stolen task which is 
computing at the root of its up-pass tree, and it writes into the execution stack of its parent r'. 
But in this case, u will be a w' node for r', and this is already handled by the analysis of parts 
3a-3c above for the parent task t'. ■ 

£4 definition and analysis Conceivably, there is one block storing a part of a global array which 
has not yet been accounted for. This is the block, if any, that is being shared between a global 
array and the execution stack for the task initiating the BP computation. (While this may seem 
contrived in the context of a single BP computation, this is a definite possibility when the BP 
computation is a subroutine of a larger HBP computation; the local variables declared in the HBP 
computation are global variables from the perspective of the BP computation.) 
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As there is just one such block, /3, we account for it by initiaUzing £4{u) to eB for every node 
in D. Whenever an access to /? causes a block miss, f3 is accessed, £4 is decremented for every 
node in D. For specificity, we define the block miss due to a core C"s successful writes to occur on 
the last write by C prior to the block's transfer to another core (we make this definition to avoid 
the possibility of distinct accesses by core C being deemed to cause the block misses by cores Ci, 
C2 , • • • , when these misses are all due to a single uninterrupted sequence of accesses by C) . Thus 
l/j^iu) is the same for every node u in D, and as this decrement occurs at most eB times, i4{u) > 
always. The following lemma is immediate. 

Lemma 6.9. For all nodes v in the dag D, initially i4{v) = eB, and always i4{v) > 0. Further, 
for every edge {u,v), ^4(n) = i^iv). 

The next theorem follows by the same proof as for Theorem 15. 1[ 

Theorem 6.1. For a = U}{1), with probability (l — 2~®("^(*)), the number of successful steals is 
bounded by 0{p[l + a]h{t)), where h{t) denotes the height of the root t of D. In addition, the time 
spend by all the processors collectively on steals, successful and unsuccessful, is 0{p ■ s[l + a\h{t)). 

Recah that h{t) = f [^sW + ^2(i)] + ^i(i) = 0(^logn + fS), using Lemmas EJ and ESI This 
contrasts with the bound for h{t) of 0{[1 + |]i31ogn) used in the earlier Theorem 15.11 (on taking 
E = 0{B) as given by Lemma l4.4p . 

6.2 HBP computations 

The rules restricting the writes in BP computations apply equally to the down-pass and up-pass 
trees used to instantiate recursive calls in HBP algorithms (see Section[4]for the specification of Type 
2 (and Type i > 2) Hierarchical Tree Algorithms and hence of HBP algorithms). The subgraphs 
corresponding to the recursive computations are analogous to the leaves of a BP computation. This 
permits us to perform an analysis of the HBP computations which is similar to that for the BP 
computations. 

To enable such an analysis, we require that the writes by the recursive computations to the 
local variables (arrays) of their calling procedures obey an analog of the Regular Pattern for BP 
Global Variable Access, namely that the left-to-right sequence of recursive computations write to 
successive disjoint portions of the parent's array(s). Further, we require that within each recursive 
procedure, its writes to these arrays be similarly constrained. A simple way of ensuring this is to 
impose the following constraint: 

A recursive call performs its writes to such arrays by means of a BP computation that 
occurs at the end of the recursive call. 

The collection of these BP computations terminating the recursive calls obeys the Reg- 
ular Pattern for Global Variable Access by a BP Collection, namely the ith node in 
inorder in the down-pass tree collection writes locations [a(i — 1) + 1 • • a ■ i] for some 
constant a > 1, and similarly for nodes in the up-pass tree collection. (An inorder 
traversal of an ordered collection of trees traverses each tree in inorder in turn, in the 
order given by the collection.) 

An HBP algorithm can always be modified to have this structure, by accumulating such writes in 
an array local to the recursive call which is then copied at the end of the recursive call. 
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We use the ii functions, modulo some small changes, as for the BP computations. In fact, the 
definition of ii in unchanged: ii{u) is 2 times the height of u in the dag D. For i > 2, in order to 
distinguish the definitions of ii for the BP algorithms and for the HBP algorithms, henceforth we 
use the notation Cf^ to refer to the previously defined ii functions, and reserve the notation ii to 
refer to the functions used to analyze the HBP algorithms. 

One way of viewing an HBP computation is to regard its computation dag as comprising a 
collection of down-pass and up-pass trees joined together either by shared leaves (in the case of 
the two trees forming a BP computation) or by connecting edges (in the case of pairs of sequenced 
HBP computations — see the rules for constructing D at the start of Section [2]). Note that the 
two BP computations forming the prefix sum algorithm provide a simple example of a sequencing. 

Our goal is to use essentially the same ii functions as for the BP computations for each down- 
pass and up-pass tree in an HBP computation, for i >2. But, in addition, we continue to want that 
ii{u) > ii{v) for every edge {u, v) G D. To this end, we define the ii values in an HBP computation 
as follows. 

Let T' be a down-pass tree and let s' denote its root. For i = 2 and 3, s' receives i-increment 
equal to the initial value of if^(s'). For all other nodes, the i-increment is set to 0. 

For 1 = 2 and 3, for each node x in the dag D for the HBP computation, we define Aj(x) to be 
the maximum, over all paths from x to the bottommost vertex t, of the sum of i-increments along 
the path, excluding the increment at x, if any. 

For i = 2 and 3, for node x in D, we then define ii{x) to be if^{x) + Aj(x). i2{x) and i^ix) 
are updated using the same rules as for BP computations, by decrementing if^{x). 

Lemma 6.10. For each edge {u,v) in the dag D, ii{u) > ii{v) > always, for 1 < i < 3. 

Proof. The claim is immediate for ii as its value never changes. For i2 and £3, the argument is 
identical. We show it for ^2- 

If {u,v) is a tree edge, then the initial values of i2{u) = /S.2{u) + i^^ {u) = i^2{v) +i^^{u). 
Changes to i2{u) and i2{v) maintain the property that i^^ {u) > i2^{v) and hence that i2{u) = 
A2{v) + £fP(u) > A2{v) + i^^iv) = i2{v). 

For a sequencing edge, i.e. a non-tree edge, A2(n) is at least A2(f ) plus the initial value of ^(v), 
for /S.2{v) does not include node u 's 2-increment, namely the initial value of i^^{v), while A2(tt) 
does include this 2-increment. As ^f^(n) > always, i2{u) > A2{u) > A2{v) +if^{v) = i2{v) 
always. ■ 



£4 definition and analysis There is one substantial change in the definition of the ii values for 
HBP computations compared to HP computations. It concerns a block /3 that is shared between 
the segment Gr and segments for descendant nodes in D. The difficulty we face concerns parts 
of one or more arrays stored in o"^ which are being used to store the results from recursive calls 
made by r. These are analogous to the global arrays in BP algorithms. In a BP computation, 
block misses on /3 involving these arrays were accounted for by the level function. Now, by 
Lemma 14.4^ the accesses to this block could face a block delay of up to Y{\t\,B). 

We define a basic value if{v) for each vertex v procedurally, as follows. We start by setting the 
initial values of if{v) to zero. For each vertex v in the computation dag for a Type 2 task r, we 
increase the initial value of if{v) by Y{\t\,B). We say that a BP task is complete if it is the task 
for the full BP computation as opposed to a subtask. For each vertex v in the computation dag 
for a complete BP task r, we increase the initial value of i^iv) by emin{i?, |t|}. Note that each 
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vertex may be part of the subdag for multiple nested tasks. Thus the basic £^ value of a vertex v 
may be the result of multiple positive increments, one for each Type 2 task v is part of, and one 
for the complete BP task v is part of, if any. 

We define 4-increments as follows. Each node v originating a Type 2 task or a complete BP 
task receives 4-increment equal to £^{v); all other nodes receive a 4-increment of 0. Next, for each 
node X in D, we define A4(x) to be the maximum, over all paths from x to the bottom of D, of 
the sum of 4-increments along the path, excluding the increment at x, if any. Finally, we define 
U{x) =£f{x) + A^{x). 

Whenever an access to /? by r or its subtasks causes a block miss, ^4 is decremented for every 
node in the subdag Dr representing r's computation. Thus i^^u) is decremented by the same 
amount as a result of accesses to (3 for every node u in Dr, and this is a total decrement of at most 
Y{\tIB). 

This accounts for all possible block delays. Accordingly, we redefine h{v) = ii{v) + ^[i2{v) + 
^-siv) + iiiv)]. As £1 is strictly decreasing on every edge e = {u,v) in D, it follows that h{u) > 
h{v) + 2. 

We note that the definition of £4 can be extended to Type j algorithms, j > 2, by using functions 
Y appropriate for Type j algorithms. 

The following result is straightforward. 

Lemma 6.11. For each edge {u,v) in the dag D, £4{u) > i^iv) > always. 

The bounds for Lemmas 16.71 and 16.81 extend unchanged to the HBP computations, yielding the 
following Theorem. 

Theorem 6.2. Let A be a limited- access, exactly linear space bounded, Type 2 HBP algorithm. 
Also, let h{t) denote the initial level of D 's root t. Then, when scheduled under RWS, for a = u^{l), 
with probability 1 — 2~°-'^°° , for any integer a > 1, A undergoes 0{p ■ h{t){l + a)) successful steals. 
In addition, the time spend by all the processors collectively on steals, successful and unsuccessful, 
is 0{p- s[l + a]h{t)). 

It remains to bound the value of the ii at the root of the dag D. For £1 and I3, the values are 
proportional to the length of the longest path in D. The next lemma bounds the values of £2 and 
£4. 

Lemma 6.12. Let A be a limited- access, top-dominant Type 2 HBP algorithm with S\A) = Q{n). 
Suppose that each recursive call A makes has size at most s{n) < n/b, for some constant b > 1. 
Let c > 1 denote the number of collections of recursive calls made by A. 

Further suppose that A observes the constraints on writes for BP and HBP computations, and 
let A have size n. Then for every node v in its computation dag D, 

£2{v) =0{B Yl c'*^"'""^ + E • S'is^'Hn)) ), 

j<s*(n,i3) i>s*(n,B) 

£4{v)=0{B c^*(".s)+ Y d ■Y{s^^{n),B)), 

i<s*{n,B) i>s*{n,B) 

where s*{n,B) is the least number i of iterations of s such that S^{s^'^\n)) < B. 
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Proof. This is immediate from the recursive structure of the HBP computation. Let r be a task 
corresponding to a recursive computation in A. r adds 0{m.m{S^{\T\), B}) to £2 and 0{Y{\t\, B)) 
to h. U 

Corollary 6.1. For A as specified in Lemma \6.1^ if S\n) = Q{n), 

i2iv),h{v) = o{B c^*^"'""^ + Yl Y."^ ■ *^'^(^) )' 

i<s*{n,B) i>s*{n,B) j>i 

where s*{n,B) is the least number i of iterations of s such that s^^\n) < B. If in addition s{n) < 
n/(bc), where b > 1 is a constant, this is an 0{Bs* {n, B)) bound for c = 1 and an 0{Bc^*^^'^'>) 
bound for c > 1. 

Proof. By Lemma |44| for i > s*{n,B), Y {s''^'> (n) , B) = Y.j>i'^~^^^^H''^)- Substituting into the 
bound of Lemma [6.121 yields the result. ■ 

Theorem 6.3. Let A be a limited- access, top dominant Type 2 HBP algorithm with S\n) = G(n). 
Recall that c > 1 denotes the number of collections of recursive calls made by A, and that s{n) is 
a bound on the size of the recursive subproblems called by A. Then, h{t) is bounded as follows. 

(i) c = 1: 0(^7^Too + ^Bs*{r,B)), where s*{n,B) is the number of applications of r needed to 
reduce n to at most B. 

(ii) c = 2and s{n) = ^: 0{'-±^T^ + ^B^). 

(Hi) c = 2 and s{n) = n/4.- Oi^^T^o + ^V^). 

These choices of c and s{n) are the ones that occur in our HBP algorithms [6]. Similar bounds 
are readily obtained for other values of c and r. 

Proof, (i) follows immediately from Corollarv 16.11 For (ii), we have s*{n,B) = log( j°^^ ); on 
substituting in the bound from Corollary 16. 11 the result is immediate; similarly, for (iii), s*fn,B) = 
.J^. ■ 

We now pull everything together to bound the runtime of Type 2 HBP algorithms. 

Theorem 6.4. Let A be a limited- access, top dominant Type 2 HBP algorithm with S''(n) = G(n). 
Let c > 1 denote the number of collections of recursive calls made by A. Suppose that each recursive 
call A makes has size at most s{n) < n/b, for some constant b > 1. Further suppose that A observes 
the constraints on writes for BP and HBP computations as specified in this section, and let A have 
size n. Let W be the worst case operation count when A is executed sequentially, Q its worst 
case cache miss cost again when executed sequentially, and let C{S,n) be an upper bound on the 
additional cache miss count when A incur S steals in its execution. 

Then, with probability 1 — 2~®('*'^°°), for integer a = U}{1), A's runtime is given by: 

\p p p p 

and S = 0{0{p ■ [1 + a]h{t)), where h{t) = 0{T^ + -M2{t) + U{t)]) . 
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Proof. A^s performance, in terms of operation count and cache miss costs is bounded by the cost 
of a sequential computation plus the additional costs associated with the S steals. There are three 
components to these additional costs: the additional cache misses; the block wait cost, which by 
Lemma 14.41 is 0{BS); and the work performed in making successful and unsuccessful steals, which 
by Theorem [O] is 0{p ■ s[l + a]h{t)) with probability 1 - 2"'*^°°. Further, S = 0{0{p ■ [I + a]h{t)) 
with probability 1 - 2""'^°°, again by TheoremlOl Finally, h{t) = + ^[hit) + + h{t)] = 

o(T^ + l[i2it)+um. u 

Note that Lemma 16.121 Corollary 16.11 and Theorem 16.31 provide bounds on ^2(0 and i^it). 

Corollary 6.2. Under the conditions of Theorem \ 6.4\ if s = 0(6) and C{S,n) + S ■ B = 0{Q) 
then the execution of A under RWS using p processors achieves an optimal Q{p) speedup compared 
to the sequential execution. 



7 Algorithms Runtimes 

Bounds on the runtimes of the MM algorithms discussed earlier can be readily deduced using 
Corollaries 13.11 and 13. 2^ Theorems 16.21 and 16.31 and Corollary 16.21 as shown in the next lemma. 

Lemma 7.1. For each of the MM algorithms, let S be the number of steals it incurs. In a sequential 
execution, each MM algorithm incurs 0{n^/{BM^I^)) cache misses. Under RWS, the steals cause 
an additional C{S,n) = 0{n^ / {BM^/"^) + S^/^^ + S) cache misses. The block delay is 0{S ■ B). 
This is optimal if s = 9(6) and S ■ M^/^ maxjS^, M} < , i.e. the runtime is 

.3 



oil 



n-^ + b 



B • AfV2 



The depth n matrix multiply algorithm, with local arrays for holding partial results, incurs, with 
probability 1 - 1/2''", for a = oj{l), S = 0{[^pn + ^pn/B][l + a\) successful steals. With a = 1 
and s = G(6), this is optimal if p < /{B^/^M^/^) and M > B^. 

The depth log^ n matrix multiply algorithm incurs, with probability 1 — 1/2'^^°^ for a = U}{1), 
S = 0{[^^plog^ n + ^pi?logn][l + a]) steals. With a = 1 and s = Q{b), this is optimal if 
p(log^ n + Blogn) < and M > B"^. 

Proof. The bound on C{S,n) is given in Corollaries 13.11 and 13 . 2i By Corollarv 16.21 this is optimal 
S-B < n^/{BM^/^) or S ■ B^M^/^ < n^, and (ii) S^^^^ < n^/{BM^/^) or S ■ M^'"^ < n^. 



The depth n algorithm has Too = 0(n). By Theorem 16. 3| h{t) = 0{n + ^n^/B), as for this 
algorithm s(n^) = n^/4 and c = 2. Thus, by Theorem 16.21 with probability 1 — 1/2'*", for a = w(l), 
the depth n algorithm incurs S = 0{[^^pn+^pny/B][\+a]) successful steals. If a = 1 and s = 9(6), 
this is 0{pn\fB) steals. On substituting for S, we see that this is optimal if pnVS < and 
B^ < M. 

The depth log^ n algorithm has Too = O(log^n). By Theorem 16.31 ^(*) = 0(log^n + ^Slogn), 
as for this algorithm s(n^) = n^/4 but c = 1. Thus, by Theorem [621 with probability 1 — 1/2**^°^^ ", 
for a = oj{l), the depth log^ n algorithm incurs S = 0{[^^plog^ n + ^pi?logn][l + a]) successful 
steals. If a = 1 and s = 9(6), this is 0(plogn[logn + i?]) steals. On substituting for S, we see that 
this is optimal if p(log^ n + Blogn) < -jj^ and B"^ < M. ■ 
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Next, we state bounds for the following additional algorithms: matrix transpose (when in BI 
format), FFT [6] and sorting [7]. (The FFT algorithm treats the data as being in a 2-D matrix, and 
repeatedly transposes suitable submatrices. Again we assume the matrix is in BI format; as this 
algorithm has Too = O (log n log log n) a different algorithm for the BI to RM conversion is needed. 
It is given in our companion paper [6j. Again, its costs are dominated in all regards by those for 
the FFT algorithm.) 

Theorem 7.1. The following algorithms have the stated runtimes with probability 1 — 2"®^'^'^°°), 
for integer a = a;(l), where W is their operation count, Q their cache miss count when executed 
sequentially, and C{S,n) their additional cache miss count when they incur S steals. 

\P P P P J 

(i) BP algorithms {e.g. prefix sums). 

W = 0{n), Q = 0{n/B), = O(logn), S = 0{p{^ log n + \B) {I + a)), C{S,n) = 0{S). 
Assuming that s = Q{b) and a = 0{1), this has combined cache and block miss costs of the same 
magnitude as the sequential cache miss cost when pB{logn + B) < n. 

(a) Matrix transpose, RM to BI conversion. 

The bounds from (i) apply, with replacing n, as this is a BP algorithm. (For matrix transpose, 
this is assuming that the matrix is in BI format. Possible complementary BI to RM algorithms are 
discussed in f^.) 

(Hi) Sort: See [7J for a description of this algorithm. 

W = 0(n log n), Q = /ogM ) ^.ssuming that M > B^ [the "tall cache assumption"). Too = 
O(lognloglogn), S = 0(p(^ lognloglogn + fi?g^)(l + a)), C{S,n) = 0{2^^^) where 

~ ^( j/i/2j loglj )^ f^''" ■^o'Tie integer j > 1. Assuming that s = Q{b) and a = 0(1), this has 
combined cache and block miss costs of the same magnitude as the sequential cache miss cost when 
pB{lognlog\ogn + B^) < min{f g^, -^g^}, and j = 0(1), i.e. pi? log M (log log n + 

S/log5)max{5,MV2^} < n. 

{iv) FFT: See 16] for a description of this algorithm. The same bounds as for sorting apply. 

Proof. The main bound as a function of S is given by Theorem 16.41 

We explain the sorting bounds in more detail. A task of size r incurs 0{r / B+^/r) = 0{r / B+B) 
cache misses in this algorithm. As the block miss cost of a stolen task is bounded by 0{B), the 
second term in the cache miss bound is absorbed into the block miss cost for the purposes of the 
analysis. 

Here, there are collections of recursive problems of sizes n, ^/n, n*, etc. There are 2 collections 
of the problems of size ^/n, 4 of those of size n*, etc. For each collection, there are @{n/B) cache 
misses, when the problems are of size 2M or larger. For subproblems of size M or smaller, steals 
may result in another 0{n/B) cache misses for each full collection of subproblems. As usual, the 
worst case arises if the largest possible size subproblems are stolen. This occurs with j chosen as 
small as possible so that S = 0{-^j^^^). 

The block miss cost is 0{S ■ B). This yields the two bounds on p in the final result. ■ 
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The list ranking algorithm in j6] iterates a sorting algorithm O(logn) times, so its bounds are 
no more than O(logn) times the bounds for the sort given above (in fact, somewhat better bounds 
can be obtained, as the combined size of the sorting problems is 0(n log n)). 

The connected components algorithm in [6j iterates the list ranking algorithm O(logn) times, 
so its bounds are no more than O(logn) times the bounds for the list ranking algorithm, but the 
sizes of each successive list ranking problem are the same. 
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